30 research outputs found

    Improving consensus structure by eliminating averaging artifacts

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Common structural biology methods (i.e., NMR and molecular dynamics) often produce ensembles of molecular structures. Consequently, averaging of 3D coordinates of molecular structures (proteins and RNA) is a frequent approach to obtain a consensus structure that is representative of the ensemble. However, when the structures are averaged, artifacts can result in unrealistic local geometries, including unphysical bond lengths and angles.</p> <p>Results</p> <p>Herein, we describe a method to derive representative structures while limiting the number of artifacts. Our approach is based on a Monte Carlo simulation technique that drives a starting structure (an extended or a 'close-by' structure) towards the 'averaged structure' using a harmonic pseudo energy function. To assess the performance of the algorithm, we applied our approach to Cα models of 1364 proteins generated by the TASSER structure prediction algorithm. The average RMSD of the refined model from the native structure for the set becomes worse by a mere 0.08 Å compared to the average RMSD of the averaged structures from the native structure (3.28 Å for refined structures and 3.36 A for the averaged structures). However, the percentage of atoms involved in clashes is greatly reduced (from 63% to 1%); in fact, the majority of the refined proteins had zero clashes. Moreover, a small number (38) of refined structures resulted in lower RMSD to the native protein versus the averaged structure. Finally, compared to PULCHRA <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>, our approach produces representative structure of similar RMSD quality, but with much fewer clashes.</p> <p>Conclusion</p> <p>The benchmarking results demonstrate that our approach for removing averaging artifacts can be very beneficial for the structural biology community. Furthermore, the same approach can be applied to almost any problem where averaging of 3D coordinates is performed. Namely, structure averaging is also commonly performed in RNA secondary prediction <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>, which could also benefit from our approach.</p

    Einstein-Roscoe regression for the slag viscosity prediction problem in steelmaking

    Get PDF
    In classical machine learning, regressors are trained without attempting to gain insight into the mechanism connecting inputs and outputs. Natural sciences, however, are interested in finding a robust interpretable function for the target phenomenon, that can return predictions even outside of the training domains. This paper focuses on viscosity prediction problem in steelmaking, and proposes Einstein-Roscoe regression (ERR), which learns the coefficients of the Einstein-Roscoe equation, and is able to extrapolate to unseen domains. Besides, it is often the case in the natural sciences that some measurements are unavailable or expensive than the others due to physical constraints. To this end, we employ a transfer learning framework based on Gaussian process, which allows us to estimate the regression parameters using the auxiliary measurements available in a reasonable cost. In experiments using the viscosity measurements in high temperature slag suspension system, ERR is compared favorably with various machine learning approaches in interpolation settings, while outperformed all of them in extrapolation settings. Furthermore, after estimating parameters using the auxiliary dataset obtained at room temperature, an increase in accuracy is observed in the high temperature dataset, which corroborates the effectiveness of the proposed approach

    pLMSNOSite: an ensemble-based approach for predicting protein S-nitrosylation sites by integrating supervised word embedding and embedding from pre-trained protein language model

    Get PDF
    Background: Protein S-nitrosylation (SNO) plays a key role in transferring nitric oxide-mediated signals in both animals and plants and has emerged as an important mechanism for regulating protein functions and cell signaling of all main classes of protein. It is involved in several biological processes including immune response, protein stability, transcription regulation, post translational regulation, DNA damage repair, redox regulation, and is an emerging paradigm of redox signaling for protection against oxidative stress. The development of robust computational tools to predict protein SNO sites would contribute to further interpretation of the pathological and physiological mechanisms of SNO. Results: Using an intermediate fusion-based stacked generalization approach, we integrated embeddings from supervised embedding layer and contextualized protein language model (ProtT5) and developed a tool called pLMSNOSite (protein language model-based SNO site predictor). On an independent test set of experimentally identified SNO sites, pLMSNOSite achieved values of 0.340, 0.735 and 0.773 for MCC, sensitivity and specificity respectively. These results show that pLMSNOSite performs better than the compared approaches for the prediction of S-nitrosylation sites. Conclusion: Together, the experimental results suggest that pLMSNOSite achieves significant improvement in the prediction performance of S-nitrosylation sites and represents a robust computational approach for predicting protein S-nitrosylation sites. pLMSNOSite could be a useful resource for further elucidation of SNO and is publicly available at https://github.com/KCLabMTU/pLMSNOSite

    DFT-aided machine learning-based discovery of magnetism in Fe-based bimetallic chalcogenides

    Get PDF
    With the technological advancement in recent years and the widespread use of magnetism in every sector of the current technology, a search for a low-cost magnetic material has been more important than ever. The discovery of magnetism in alternate materials such as metal chalcogenides with abundant atomic constituents would be a milestone in such a scenario. However, considering the multitude of possible chalcogenide configurations, predictive computational modeling or experimental synthesis is an open challenge. Here, we recourse to a stacked generalization machine learning model to predict magnetic moment (µB) in hexagonal Fe-based bimetallic chalcogenides, FexAyB; A represents Ni, Co, Cr, or Mn, and B represents S, Se, or Te, and x and y represent the concentration of respective atoms. The stacked generalization model is trained on the dataset obtained using first-principles density functional theory. The model achieves MSE, MAE, and R2 values of 1.655 (µB)2, 0.546 (µB), and 0.922 respectively on an independent test set, indicating that our model predicts the compositional dependent magnetism in bimetallic chalcogenides with a high degree of accuracy. A generalized algorithm is also developed to test the universality of our proposed model for any concentration of Ni, Co, Cr, or Mn up to 62.5% in bimetallic chalcogenides

    Improving protein succinylation sites prediction using embeddings from protein language model

    Get PDF
    Protein succinylation is an important post-translational modification (PTM) responsible for many vital metabolic activities in cells, including cellular respiration, regulation, and repair. Here, we present a novel approach that combines features from supervised word embedding with embedding from a protein language model called ProtT5-XL-UniRef50 (hereafter termed, ProtT5) in a deep learning framework to predict protein succinylation sites. To our knowledge, this is one of the first attempts to employ embedding from a pre-trained protein language model to predict protein succinylation sites. The proposed model, dubbed LMSuccSite, achieves state-of-the-art results compared to existing methods, with performance scores of 0.36, 0.79, 0.79 for MCC, sensitivity, and specificity, respectively. LMSuccSite is likely to serve as a valuable resource for exploration of succinylation and its role in cellular physiology and disease

    Preventing Discriminatory Decision-making in Evolving Data Streams

    Full text link
    Bias in machine learning has rightly received significant attention over the last decade. However, most fair machine learning (fair-ML) work to address bias in decision-making systems has focused solely on the offline setting. Despite the wide prevalence of online systems in the real world, work on identifying and correcting bias in the online setting is severely lacking. The unique challenges of the online environment make addressing bias more difficult than in the offline setting. First, Streaming Machine Learning (SML) algorithms must deal with the constantly evolving real-time data stream. Second, they need to adapt to changing data distributions (concept drift) to make accurate predictions on new incoming data. Adding fairness constraints to this already complicated task is not straightforward. In this work, we focus on the challenges of achieving fairness in biased data streams while accounting for the presence of concept drift, accessing one sample at a time. We present Fair Sampling over Stream (FS2FS^2), a novel fair rebalancing approach capable of being integrated with SML classification algorithms. Furthermore, we devise the first unified performance-fairness metric, Fairness Bonded Utility (FBU), to evaluate and compare the trade-off between performance and fairness of different bias mitigation methods efficiently. FBU simplifies the comparison of fairness-performance trade-offs of multiple techniques through one unified and intuitive evaluation, allowing model designers to easily choose a technique. Overall, extensive evaluations show our measures surpass those of other fair online techniques previously reported in the literature

    L1pred: A Sequence-Based Prediction Tool for Catalytic Residues in Enzymes with the L1-logreg Classifier

    Get PDF
    To understand enzyme functions, identifying the catalytic residues is a usual first step. Moreover, knowledge about catalytic residues is also useful for protein engineering and drug-design. However, to experimentally identify catalytic residues remains challenging for reasons of time and cost. Therefore, computational methods have been explored to predict catalytic residues. Here, we developed a new algorithm, L1pred, for catalytic residue prediction, by using the L1-logreg classifier to integrate eight sequence-based scoring functions. We tested L1pred and compared it against several existing sequence-based methods on carefully designed datasets Data604 and Data63. With ten-fold cross-validation, L1pred showed the area under precision-recall curve (AUPR) and the area under ROC curve (AUC) of 0.2198 and 0.9494 on the training dataset, Data604, respectively. In addition, on the independent test dataset, Data63, it showed the AUPR and AUC values of 0.2636 and 0.9375, respectively. Compared with other sequence-based methods, L1pred showed the best performance on both datasets. We also analyzed the importance of each attribute in the algorithm, and found that all the scores contributed more or less equally to the L1pred performance

    RF-NR: Random Forest Based Approach for Improved Classification of Nuclear Receptors

    No full text

    Bioinformatic Analyses of Peroxiredoxins and RF-Prx: A Random Forest-Based Predictor and Classifier for Prxs

    No full text
    Peroxiredoxins (Prxs) are a protein superfamily, present in all organisms, that play a critical role in protecting cellular macromolecules from oxidative damage but also regulate intracellular and intercellular signaling processes involving redox-regulated proteins and pathways. Bioinformatic approaches using computational tools that focus on active site-proximal sequence fragments (known as active site signatures) and iterative clustering and searching methods (referred to as TuLIP and MISST) have recently enabled the recognition of over 38,000 peroxiredoxins, as well as their classification into six functionally relevant groups. With these data providing so many examples of Prxs in each class, machine learning approaches offer an opportunity to extract additional information about features characteristic of these protein groups. In this study, we developed a novel computational method named “RF-Prx” based on a random forest (RF) approach integrated with K-space amino acid pairs (KSAAP) to identify peroxiredoxins and classify them into one of six subgroups. Our process performed in a superior manner compared to other machine learning classifiers. Thus the RF approach integrated with K-space amino acid pairs enabled the detection of class-specific conserved sequences outside the known functional centers and with potential importance. For example, drugs designed to target Prx proteins would likely suffer from cross-reactivity among distinct Prxs if targeted to conserved active sites, but this may be avoidable if remote, class-specific regions could be targeted instead

    Parallel-SymD: A Parallel Approach to Detect Internal Symmetry in Protein Domains

    No full text
    Internally symmetric proteins are proteins that have a symmetrical structure in their monomeric single-chain form. Around 10–15% of the protein domains can be regarded as having some sort of internal symmetry. In this regard, we previously published SymD (symmetry detection), an algorithm that determines whether a given protein structure has internal symmetry by attempting to align the protein to its own copy after the copy is circularly permuted by all possible numbers of residues. SymD has proven to be a useful algorithm to detect symmetry. In this paper, we present a new parallelized algorithm called Parallel-SymD for detecting symmetry of proteins on clusters of computers. The achieved speedup of the new Parallel-SymD algorithm scales well with the number of computing processors. Scaling is better for proteins with a larger number of residues. For a protein of 509 residues, a speedup of 63 was achieved on a parallel system with 100 processors
    corecore